In the first post the data were collected. Now we want to get a first impression of the collected data. What is the structure of the data, how are they distributed and, most importantly, do they still need pre-processing?
In the following I will present the data in different ways. In technical jargon this is called descriptive analytics. The following evaluations are only an excerpt of what is possible and do not claim to be complete.
library(plotly)
library(tidyverse)
library(readr)
library(fs)
After I collected the data in my last post, it was saved in a local .rds file. This provides a high performance way to store data that is processed with R. With the readr package the file can be loaded via read_rds().
HousingData <- readr::read_rds(dir_ls("~/Documents/R/My_Website",
recurse = TRUE) %>%
path_abs() %>%
str_subset("HousingData.Rds"))
With the dplyr function glimpse(), you can quickly create an overview of individual table columns. You get the data type as well as an extract of the data in form of the first columns. If we look at the last four columns, we notice that they are all of type character and contain certain additions such as the currency or the area unit. In order to evaluate the data later, however, it is absolutely necessary to have all data in an appropriate format. Accordingly, all unnecessary attributes must be removed and then converted into a numeric or double format.
glimpse(HousingData)
## Observations: 5,383
## Variables: 7
## $ County <chr> "Alb-Donau", "Alb-Donau", "Alb-Donau", "Alb-Donau",...
## $ Title <chr> "NEU//Schloss Dellmensingen //Denkmalschutz //Entwi...
## $ Location <chr> "Werdensteinstraße 25, Erbach, Alb-Donau-Kreis", "L...
## $ Price <chr> "870.000 €", "749.000 €", "519.000 €", "520.000 €",...
## $ LivingArea <chr> "426 m²", "230 m²", "197 m²", "212 m²", "166,17 m²"...
## $ Rooms <chr> "25 ", "6,5 ", "7,5 ", "8 ", "7 ", "9 ", "10 ", "8 ...
## $ SiteArea <chr> "3.000 m²", "1.029 m²", "875 m²", "377 m²", "761 m²...
HousingData_clean <- HousingData %>%
as_tibble() %>%
mutate(Price = as.numeric(str_replace_all(
str_remove_all(Price, "[€.]"), ",", ".")),
LivingArea = as.numeric(str_replace_all(
str_remove_all(LivingArea, "[m².]"), ",", ".")),
Rooms = as.numeric(str_replace(Rooms, ",", ".")),
SiteArea = as.numeric(str_replace_all(
str_remove_all(SiteArea, "[m².]"), ",", ".")))
After the cleanup, all data is now available in the desired, processable format. Additionally we can use the base-function summary() to get a first impression about the distribution of the numerical values.
glimpse(HousingData_clean)
## Observations: 5,364
## Variables: 7
## $ County <chr> "Alb-Donau", "Alb-Donau", "Alb-Donau", "Alb-Donau",...
## $ Title <chr> "NEU//Schloss Dellmensingen //Denkmalschutz //Entwi...
## $ Location <chr> "Werdensteinstraße 25, Erbach, Alb-Donau-Kreis", "L...
## $ Price <dbl> 870000, 749000, 519000, 520000, 497000, 459000, 799...
## $ LivingArea <dbl> 426.00, 230.00, 197.00, 212.00, 166.17, 237.53, 257...
## $ Rooms <dbl> 25.0, 6.5, 7.5, 8.0, 7.0, 9.0, 10.0, 8.0, 7.0, 6.0,...
## $ SiteArea <dbl> 3000, 1029, 875, 377, 761, 380, 1062, 3752, 1040, 5...
summary(HousingData_clean)
## County Title Location
## Length:5364 Length:5364 Length:5364
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Price LivingArea Rooms SiteArea
## Min. : 1 Min. : 0.0 Min. : 1.000 Min. : 0.0
## 1st Qu.: 344000 1st Qu.: 140.0 1st Qu.: 5.000 1st Qu.: 325.8
## Median : 489000 Median : 180.0 Median : 6.500 Median : 573.0
## Mean : 583745 Mean : 215.7 Mean : 7.592 Mean : 1459.0
## 3rd Qu.: 690000 3rd Qu.: 250.0 3rd Qu.: 9.000 3rd Qu.: 901.0
## Max. :14700000 Max. :2900.0 Max. :58.000 Max. :976200.0
We look at advertisements for the different counties in Baden Würtemberg. To determine if the number of advertisements per county varies, we group the data per county and sum them up. In the chart below we can see that the Rhine-Neckar district has the most advertisements with 417, followed by Karlsruhe (277) and Esslingen (258). The fewest advertisements are for Tübingen (82) and Emmendingen (47).
HousingData_clean %>%
group_by(County) %>%
summarise(Count = n()) %>%
ungroup() %>%
plot_ly(x = ~Count,
y = ~reorder(County, Count)) %>%
add_bars() %>%
layout(xaxis = list(title = "No of Houses"),
yaxis = list(title = "County"))
To make the first summary with summary() even more detailed, we can create a boxplot for each attribute. For this I like to use the plotly package, because interactive SVG graphics can be created. If you want to know more about boxplots, you can find out more here.
To create our final plot, we create a single plot (p1, p2, p3 and p4) for every attribute and combine them via subplot.
p1 <- HousingData_clean %>%
drop_na(Price) %>%
plot_ly(y = ~Price, type = "box", name = "Price [€]") %>%
layout(showlegend = FALSE)
p2 <- HousingData_clean %>%
plot_ly(y = ~LivingArea, type = "box", name = "Living Area [sqm]") %>%
layout(showlegend = FALSE)
p3 <- HousingData_clean %>%
plot_ly(y = ~SiteArea, type = "box", name = "Site Area [sqm]") %>%
layout(showlegend = FALSE)
p4 <- HousingData_clean %>%
plot_ly(y = ~Rooms, type = "box", name = "Rooms [no]") %>%
layout(showlegend = FALSE)
subplot(p1, p2, p3, p4)
Individual attributes can be viewed in more detail. For example, a box plot can be created for each county and displayed in comparison to the others.
HousingData_clean %>%
drop_na(Price) %>%
plot_ly(x = ~Price, color = ~County, type = "box")
The price is often the determining factor after evaluation. However, this logic is often too brief. We want to know how the price relates to the attributes living area, site area and number of rooms.
We will also apply a linear regression to the data to see how the individual parameters behave throughout. This is achieved by the lm() function. The relevant values are extracted using fitted().
predefine_xaxis <- list(title = "Price [€]")
p1 <- HousingData_clean %>%
drop_na(Price) %>%
plot_ly(x = ~Price,
y = ~LivingArea) %>%
add_markers(color = ~County,
alpha = 0.5,
legendgroup = ~County,
showlegend = T) %>%
add_lines(y = ~fitted(lm(LivingArea ~ Price)),
line = list(color = '#07A4B5'),
name = "LM",
legendgroup = "LM",
showlegend = T) %>%
layout(xaxis = predefine_xaxis,
yaxis = list(title = "Living Area [sqm]"))
p2 <- HousingData_clean %>%
drop_na(Price) %>%
plot_ly(x = ~Price,
y = ~SiteArea) %>%
add_markers(color = ~County,
alpha = 0.5,
legendgroup = ~County,
showlegend = F) %>%
add_lines(y = ~fitted(lm(SiteArea ~ Price)),
line = list(color = '#07A4B5'),
name = "LM",
legendgroup = "LM",
showlegend = F) %>%
layout(xaxis = predefine_xaxis,
yaxis = list(title = "Site Area [sqm]"))
p3 <- HousingData_clean %>%
drop_na(Price) %>%
plot_ly(x = ~Price,
y = ~Rooms) %>%
add_markers(color = ~County,
alpha = 0.5,
legendgroup = ~County,
showlegend = F) %>%
add_lines(y = ~fitted(lm(Rooms ~ Price)),
line = list(color = '#07A4B5'),
name = "LM",
legendgroup = "LM",
showlegend = F) %>%
layout(xaxis = predefine_xaxis,
yaxis = list(title = "Rooms [no]"))
subplot(p1, p2, p3, plotly_empty(),
nrows = 2,
shareX = TRUE, titleY = TRUE)
This post gave us a first insight into the data. This impression is decisive when it comes to gaining added value from data in the form of reliable findings.